W9. Probabilistic Analysis and Randomized Algorithms

Author

Nikolai Kudasov

Published

March 18, 2026

1. Summary

1.1 Probability Basics

1.1.1 Sample Spaces and Events

To reason precisely about randomness, we need a formal vocabulary. The starting point is the sample space \(S\): the set of all possible outcomes of an experiment. For example, when tossing a coin three times, every sequence of heads (H) and tails (T) of length three is a possible outcome:

\[S = \{HHH, HHT, HTH, HTT, THH, THT, TTH, TTT\}\]

An event is any subset of the sample space. For instance, the event “exactly one tail” corresponds to:

\[P_1 = \{HHT, HTH, THH\}\]

This formal setup lets us reason carefully about what can happen and how likely each outcome is.

1.1.2 Axioms of Probability

A probability distribution is a function \(\text{Pr} : \mathcal{P}(S) \rightarrow \mathbb{R}\) (where \(\mathcal{P}(S)\) is the set of all subsets of \(S\)) that assigns a real number to every event and satisfies four axioms:

\(\text{Pr}\{A\} \geq 0\) for any event \(A\) — probabilities are non-negative.
\(\text{Pr}\{S\} = 1\) — the probability of some outcome happening is 1.
\(\text{Pr}\{A \cup B\} = \text{Pr}\{A\} + \text{Pr}\{B\}\) for any two mutually exclusive events \(A\) and \(B\) (i.e., \(A \cap B = \emptyset\)).
\(\text{Pr}\left\{\bigcup_i A_i\right\} = \sum_i \text{Pr}\{A_i\}\) for any finite or countably infinite sequence of pairwise mutually exclusive events.

\(\text{Pr}\{A\}\) is called the probability of event \(A\).

1.1.3 Basic Properties of Probability

The following properties follow directly from the axioms:

\(\text{Pr}\{\emptyset\} = 0\)

Proof: \(\emptyset \cap S = \emptyset\), so \(\text{Pr}\{\emptyset \cup S\} = \text{Pr}\{\emptyset\} + \text{Pr}\{S\}\). Since \(\emptyset \cup S = S\), we get \(\text{Pr}\{S\} = \text{Pr}\{\emptyset\} + \text{Pr}\{S\}\), hence \(\text{Pr}\{\emptyset\} = 0\).
\(\text{Pr}\{\overline{A}\} = 1 - \text{Pr}\{A\}\) where \(\overline{A} = S \setminus A\) is the complement of \(A\).

Proof: \(A \cap \overline{A} = \emptyset\) and \(A \cup \overline{A} = S\), so \(\text{Pr}\{A\} + \text{Pr}\{\overline{A}\} = \text{Pr}\{S\} = 1\).
\(\text{Pr}\{A \cup B\} = \text{Pr}\{A\} + \text{Pr}\{B\} - \text{Pr}\{A \cap B\} \leq \text{Pr}\{A\} + \text{Pr}\{B\}\)

Proof: We can split \(B\) into the disjoint parts \(A \cap B\) and \(B \setminus (A \cap B)\). Then \(A \cup B = A \cup (B \setminus (A \cap B))\), so \(\text{Pr}\{A \cup B\} = \text{Pr}\{A\} + \text{Pr}\{B\} - \text{Pr}\{A \cap B\}\).

Example (coin tossing): Tossing a fair coin 3 times. Assuming all 8 outcomes are equally likely, each has probability \(\frac{1}{8}\). The event \(P_1 = \{HHT, HTH, THH\}\) has probability \(\text{Pr}\{P_1\} = \frac{3}{8}\).

1.2 Discrete Probability Distributions

A probability distribution is called discrete if it is defined on a finite or countably infinite sample space.

A special case is the uniform distribution: if \(S\) is finite and every outcome \(s \in S\) has probability \(\text{Pr}\{s\} = \frac{1}{|S|}\), the distribution is uniform. We say “we pick an element of \(S\) uniformly at random.”

Examples:

Rolling a fair six-sided die: \(S = \{1, 2, 3, 4, 5, 6\}\) with each outcome having probability \(\frac{1}{6}\). This is uniform.
Sum of two fair dice: The sum ranges from 2 to 12, but not all sums are equally likely (e.g., a sum of 7 is more likely than 2). This is not uniform.
Choosing a random student ID from a class of \(n\) students: uniform (each ID equally likely).

1.3 Discrete Random Variables

A discrete random variable \(X\) is a function \(X : S \rightarrow \mathbb{R}\) that maps each outcome to a real number. For example, \(X\) might represent the score in a game or the number of heads in a coin-toss experiment.

We define: \[\text{Pr}\{X = x\} = \sum_{s \in S : X(s) = x} \text{Pr}\{s\}\]

The function \(f(x) = \text{Pr}\{X = x\}\) is called the probability mass function of \(X\).

1.3.1 Expected Value

The expected value (also called the expectation or mean) of a random variable \(X\) is its probability-weighted average:

\[E[X] = \sum_x x \cdot \text{Pr}\{X = x\}\]

The expected value tells you what the “average” outcome would be if you repeated the experiment many times. For example, if you toss a fair coin and gain 1 point for heads and 0 for tails, the expected value is \(\frac{1}{2}\): on average, you gain half a point per toss.

Example: In a game where tossing two coins gives \(+6\) for HH, \(+1\) for HT or TH, and \(-4\) for TT:

\[E[X] = 6 \cdot \frac{1}{4} + 1 \cdot \frac{1}{2} + (-4) \cdot \frac{1}{4} = \frac{6}{4} + \frac{2}{4} - \frac{4}{4} = 1\]

1.3.2 Linearity of Expected Value

Expected value is a linear operator, meaning it satisfies:

\[E[X + Y] = E[X] + E[Y]\] \[E[\alpha X] = \alpha E[X] \text{ for any scalar constant } \alpha\]

This holds for any random variables \(X\) and \(Y\), not just independent ones. Linearity is the most-used property in probabilistic analysis, since it lets us split complex random variables into simple indicator variables.

If \(X\) is a random variable and \(g\) is any function, then \(g(X)\) is also a random variable and: \[E[g(X)] = \sum_x g(x) \cdot \text{Pr}\{X = x\}\]

Note that in general \(E[g(X)] \neq g(E[X])\) — for instance, \(E[X^2] \neq (E[X])^2\) unless \(X\) is constant.

When does \(E[X]\) fail to exist? The expected value is defined as a sum \(\sum_x x \cdot \text{Pr}\{X = x\}\). This sum may diverge — grow without bound — in which case we say the expected value does not exist. A classic example is the St. Petersburg paradox: a game where you toss a fair coin repeatedly and win \(2^k\) dollars if the first heads appears on toss \(k\). The expected winnings are \(\sum_{k=1}^{\infty} 2^k \cdot \frac{1}{2^k} = \sum_{k=1}^{\infty} 1 = \infty\) — the expected value is infinite, so it “does not exist” in the usual finite sense. In algorithm analysis we always deal with finite, well-behaved distributions, so this issue does not arise in practice.

1.3.3 Independent Random Variables

Random variables \(X\) and \(Y\) are independent if knowing the value of one gives no information about the value of the other. Formally, \(X\) and \(Y\) are independent if and only if:

\[\text{Pr}\{X = x \text{ and } Y = y\} = \text{Pr}\{X = x\} \cdot \text{Pr}\{Y = y\}\]

for all values \(x\) and \(y\).

Example of dependent variables: Let \(S = \{H, T\}\) (one coin toss). Define \(X(H) = 1, X(T) = 0\) and \(Y(H) = 0, Y(T) = 1\). Then \(\text{Pr}\{X = 1 \text{ and } Y = 1\} = 0\), but \(\text{Pr}\{X = 1\} \cdot \text{Pr}\{Y = 1\} = \frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4} \neq 0\). So \(X\) and \(Y\) are dependent: if you know \(X = 1\) (heads), you know for certain \(Y = 0\).

Multiplication rule for independent variables: If \(X\) and \(Y\) are independent, then: \[E[XY] = E[X] \cdot E[Y]\]

This is stronger than the additivity property and only holds when \(X\) and \(Y\) are independent.

1.4 Indicator Random Variables

An indicator random variable (also called a Bernoulli random variable) for event \(A\) is defined as:

\[I\{A\} = \begin{cases} 1 & \text{if event } A \text{ occurs,} \\ 0 & \text{if event } A \text{ does not occur.} \end{cases}\]

Indicator variables are the main tool for converting probability statements into expected value calculations. The key lemma connects them:

Lemma (Cormen et al. 2022, Lemma 5.1). Let \(X_A = I\{A\}\). Then \(E[X_A] = \text{Pr}\{A\}\).

Proof: \[E[X_A] = 1 \cdot \text{Pr}\{A\} + 0 \cdot \text{Pr}\{\overline{A}\} = \text{Pr}\{A\} \quad \square\]

Why are indicator variables useful? They let us decompose complex random variables into sums of simple ones. If \(X = X_1 + X_2 + \cdots + X_n\) where each \(X_i = I\{A_i\}\), then by linearity:

\[E[X] = \sum_{i=1}^n E[X_i] = \sum_{i=1}^n \text{Pr}\{A_i\}\]

This is much easier to compute than reasoning directly about the distribution of \(X\).

Example (expected number of heads): When tossing a fair coin \(n\) times, let \(X_i = I\{\text{the } i\text{-th toss is heads}\}\) and \(X = \sum_{i=1}^n X_i\) be the total number of heads. Then:

\[E[X] = \sum_{i=1}^n E[X_i] = \sum_{i=1}^n \frac{1}{2} = \frac{n}{2}\]

1.5 Probabilistic Analysis

Probabilistic analysis is the study of an algorithm’s behavior (especially its running time) using probability theory. Instead of asking “what is the worst possible running time?”, we ask “what is the expected running time averaged over all possible inputs?”

To apply probabilistic analysis, we need:

A known or assumed distribution of inputs.
Justified assumptions that the distribution is realistic for the use case.
Computation of the expected running time by averaging over all inputs under that distribution.

If we cannot justify a reasonable input distribution, probabilistic analysis may not be applicable.

1.5.1 The Hiring Problem

Imagine you are hiring an AI agent for a task. There are \(n\) candidates, and you evaluate them one by one. Evaluating a candidate costs \(c_c\), and hiring (replacing the current agent) costs \(c_h > c_c\).

HIRE-ASSISTANT(n)
1  best = 0         // candidate 0 is a least-qualified dummy
2  for i = 1 to n
3      interview candidate i      // cost c_c
4      if candidate i is better than candidate best
5          best = i
6          hire candidate i       // cost c_h

Worst case: Candidates arrive in increasing order of quality — we hire every candidate. Total hiring cost: \(c_h n\).

Best case: The best candidate arrives first — we hire only once. Total hiring cost: \(c_h\).

Probabilistic analysis: Assume candidates arrive in a uniformly random order (each of the \(n!\) permutations equally likely). Let \(X_i = I\{\text{the } i\text{-th candidate is hired}\}\). The \(i\)-th candidate is hired exactly when they are the best among the first \(i\) candidates. Since the order is random, any of the first \(i\) candidates is equally likely to be the best, so:

\[\text{Pr}\{\text{the } i\text{-th candidate is hired}\} = \frac{1}{i}\]

The expected number of hires is:

\[E[X] = E\left[\sum_{i=1}^n X_i\right] = \sum_{i=1}^n E[X_i] = \sum_{i=1}^n \frac{1}{i} = \Theta(\log n)\]

This uses the fact that \(\sum_{i=1}^n \frac{1}{i} = H_n\) (the \(n\)-th harmonic number), which grows as \(\Theta(\log n)\). So on average, we expect only \(\Theta(\log n)\) expensive hires even though we evaluate all \(n\) candidates.

1.6 Randomized Algorithms

The probabilistic analysis above required assuming a uniform random input. But what if candidates arrive in a specific (possibly adversarial) order? A randomized algorithm takes control of this uncertainty: instead of hoping the input is random, it makes the input random.

Definition: An algorithm is randomized if its behavior depends not only on the input but also on random choices made during execution (values from a random-number generator).

Randomized hiring: Before running HIRE-ASSISTANT, randomly shuffle the candidates using:

RANDOMLY-PERMUTE(A, n)
1  for i = 1 to n
2      swap A[i] with A[RANDOM(i, n)]

This produces a uniformly random permutation of the candidates. Now, regardless of the original input order, we can guarantee that the expected number of hires is \(\Theta(\log n)\) — even against an adversary who chose the worst possible original order.

The key insight: probabilistic analysis relies on the input being random; randomized algorithms rely on the algorithm itself making random choices.

1.7 Randomized Quicksort

Quicksort is a classic divide-and-conquer sorting algorithm with the following properties:

It is a divide-and-conquer algorithm.
It runs in \(\Theta(n \log n)\) on average but \(\Theta(n^2)\) in the worst case.
It is stable (equal elements maintain their relative order).
It has an efficient in-place implementation (no extra array needed).

The idea of the algorithm:

Divide: Partition the array around a pivot element — put all elements smaller than the pivot before it, all larger elements after it.
Conquer: A single-element subarray is already sorted.
Combine: Concatenate the sorted subarrays with the pivot in the middle.

The worst case occurs when, e.g., the array is already sorted and we always pick the first or last element as pivot.

Randomized pivot selection avoids the worst case by choosing the pivot uniformly at random:

RANDOMIZED-PARTITION(A, p, r)
1  i = RANDOM(p, r)
2  exchange A[r] with A[i]
3  return PARTITION(A, p, r)

RANDOMIZED-QUICKSORT(A, p, r)
1  if p < r
2      q = RANDOMIZED-PARTITION(A, p, r)
3      RANDOMIZED-QUICKSORT(A, p, q - 1)
4      RANDOMIZED-QUICKSORT(A, q + 1, r)

1.7.1 Analysis of Randomized Quicksort

Lemma (Cormen et al. 2022, Lemma 7.1). If \(X\) is the total number of comparisons in PARTITION across the entire execution, then the running time of QUICKSORT is \(O(n + X)\).

To estimate the expected number of comparisons, let \(z_1 < z_2 < \cdots < z_n\) be the elements in sorted order. Define:

\[X_{ij} = I\{z_i \text{ is compared with } z_j\}\]

Then \(X = \sum_{i=1}^n \sum_{j=i+1}^n X_{ij}\).

Key observation: Two elements \(z_i\) and \(z_j\) are compared if and only if one of them is the first pivot chosen from the set \(Z_{ij} = \{z_i, z_{i+1}, \ldots, z_j\}\). This is because:

If a pivot \(z_k\) with \(i < k < j\) is chosen before either \(z_i\) or \(z_j\), they are separated into different subproblems and will never be compared.
If \(z_i\) or \(z_j\) is chosen as pivot first (before any \(z_k\) with \(i < k < j\)), they are compared directly.

Since each of the \(j - i + 1\) elements in \(Z_{ij}\) is equally likely to be the first pivot from that set:

\[\text{Pr}\{z_i \text{ is compared with } z_j\} = \frac{2}{j - i + 1}\]

Therefore, the expected total number of comparisons is:

\[E[X] = \sum_{i=1}^n \sum_{j=i+1}^n \frac{2}{j-i+1} = \sum_{i=1}^n \sum_{k=1}^{n-i} \frac{2}{k+1} < \sum_{i=1}^n \sum_{k=1}^{n-i} \frac{2}{k} = \sum_{i=1}^n O(\log n) = O(n \log n)\]

Summary of randomized quicksort:

Works just like ordinary quicksort, but chooses the pivot randomly.
Expected running time: \(\Theta(n \log n)\) on any input.
Worst-case running time: still \(\Theta(n^2)\), but now it depends on the random number generator, not the input — an adversary cannot force the worst case by choosing a bad input.

1.8 Probabilistic Analysis of Bucket Sort

Bucket sort is a linear-time sorting algorithm that works under a specific assumption: the input values are uniformly distributed over some interval (typically \([0, 1)\)).

Algorithm:

Divide \([0, 1)\) into \(n\) equal-sized\(^*\) subintervals (“buckets”). Each subinterval has the same probability of receiving an element, since the input is uniform.
Assign each input element to the bucket corresponding to its subinterval: element \(a\) goes to bucket \(\lfloor na \rfloor\) (a cheap rounding operation).
Sort each bucket using insertion sort.
Concatenate the sorted buckets.

\(^*\)“Equal” here means equal in probability: each of the \(n\) buckets covers an interval of length \(\frac{1}{n}\), and since the input is uniformly distributed, each element independently falls into any given bucket with probability \(\frac{1}{n}\).

Why does this work fast? If the input is uniformly distributed, each bucket expects to receive \(\frac{n}{n} = 1\) element on average, so the insertion sort within each bucket takes \(O(1)\) expected time.

1.8.1 Formal Analysis

Let \(n_i\) be the number of elements in bucket \(i\). The running time of bucket sort is:

\[T(n) = \Theta(n) + \sum_{i=0}^{n-1} O(n_i^2)\]

(The \(\Theta(n)\) covers distributing elements into buckets; \(O(n_i^2)\) is the insertion sort cost for bucket \(i\).)

We want to compute \(E\left[\sum_{i=0}^{n-1} O(n_i^2)\right] = \sum_{i=0}^{n-1} O(E[n_i^2])\).

Let \(X_{ij} = I\{A[j] \text{ falls into bucket } i\}\), so \(n_i = \sum_{j=1}^n X_{ij}\).

Computing \(E[n_i^2]\):

\[E[n_i^2] = E\left[\left(\sum_{j=1}^n X_{ij}\right)^2\right] = E\left[\sum_{j=1}^n X_{ij}^2 + \sum_{j \neq k} X_{ij} X_{ik}\right] = \sum_{j=1}^n E[X_{ij}^2] + \sum_{j \neq k} E[X_{ij} X_{ik}]\]

First sum: Since \(X_{ij}\) is a \(\{0,1\}\)-valued indicator, \(X_{ij}^2 = X_{ij}\) (squaring 0 or 1 gives the same value). Therefore: \[E[X_{ij}^2] = 1^2 \cdot \frac{1}{n} + 0^2 \cdot \left(1 - \frac{1}{n}\right) = \frac{1}{n}\]

Second sum: For \(j \neq k\), the events “\(A[j]\) falls into bucket \(i\)” and “\(A[k]\) falls into bucket \(i\)” are independent — because \(A[j]\) and \(A[k]\) are independent uniform random variables (distinct input elements), so knowing where one falls gives no information about the other. By the multiplication rule for independent variables: \[E[X_{ij} X_{ik}] = E[X_{ij}] \cdot E[X_{ik}] = \frac{1}{n} \cdot \frac{1}{n} = \frac{1}{n^2}\]

There are \(n\) terms in the first sum and \(n(n-1)\) terms in the second sum, so:

\[E[n_i^2] = n \cdot \frac{1}{n} + n(n-1) \cdot \frac{1}{n^2} = 1 + \frac{n-1}{n} = 2 - \frac{1}{n}\]

Therefore, the expected total running time is:

\[E[T(n)] = \Theta(n) + n \cdot O\!\left(2 - \frac{1}{n}\right) = \Theta(n)\]

Bucket sort runs in \(\Theta(n)\) expected time when the input is uniformly distributed.

1.9 Randomly Built Binary Search Trees

A binary search tree (BST) is a binary tree satisfying the BST property: for every node \(x\), all keys in its left subtree are \(\leq x.\text{key}\), and all keys in its right subtree are \(\geq x.\text{key}\).

The shape (and thus efficiency) of a BST depends entirely on the order in which keys are inserted. When keys are inserted in a uniformly random order, the resulting tree is called a randomly built BST.

1.9.1 Two Different Notions of “Random BST”

It is important to distinguish two concepts:

Randomly chosen BST: Pick uniformly at random among all valid BST shapes on \(n\) nodes. The number of such shapes is the \(n\)-th Catalan number \(C_n = \frac{1}{n+1}\binom{2n}{n}\).
Randomly built BST: Start with an empty tree and insert a uniformly random permutation of \(n\) distinct keys one by one. Each of the \(n!\) permutations is equally likely, but different permutations can produce the same tree shape, so this does not give a uniform distribution over tree shapes.

These two notions differ: for \(n = 5\), there are 42 BST shapes (Catalan number \(C_5 = 42\)), but \(5! = 120\) permutations. Permutations yielding tall, skewed trees are more likely under random insertion than under random shape selection.

Expected height of a randomly built BST: It is known (Cormen et al. 2022, §12.4) that the expected height of a randomly built BST on \(n\) nodes is \(\Theta(\log n)\) — the same asymptotic height as a perfectly balanced tree.

2. Definitions

Sample space (\(S\)): The set of all possible outcomes of a random experiment.
Event: Any subset of the sample space \(S\).
Probability distribution: A function \(\text{Pr} : \mathcal{P}(S) \rightarrow \mathbb{R}\) satisfying the four axioms of probability; assigns a real number to every event.
Discrete probability distribution: A probability distribution defined on a finite or countably infinite sample space.
Uniform distribution: A discrete distribution on a finite sample space where every outcome has equal probability \(\frac{1}{|S|}\).
Discrete random variable: A function \(X : S \rightarrow \mathbb{R}\) mapping outcomes to real numbers on a finite or countably infinite sample space.
Probability mass function: The function \(f(x) = \text{Pr}\{X = x\}\) describing the probability distribution of a discrete random variable \(X\).
Expected value (\(E[X]\)): The probability-weighted average of a random variable: \(E[X] = \sum_x x \cdot \text{Pr}\{X = x\}\).
Linearity of expectation: The property \(E[X + Y] = E[X] + E[Y]\), which holds for any random variables regardless of independence.
Independent random variables: Random variables \(X\) and \(Y\) such that \(\text{Pr}\{X = x \text{ and } Y = y\} = \text{Pr}\{X = x\} \cdot \text{Pr}\{Y = y\}\) for all \(x, y\).
Indicator random variable (\(I\{A\}\)): A random variable that equals 1 if event \(A\) occurs and 0 otherwise; its expected value equals \(\text{Pr}\{A\}\).
Probabilistic analysis: The analysis of an algorithm’s performance using probability theory, typically computing the expected running time over a distribution of inputs.
Expected running time: The average running time of an algorithm computed by averaging over all possible inputs (or random choices) weighted by their probabilities.
Randomized algorithm: An algorithm whose behavior depends on the input and on values produced by a random-number generator; it takes internal random choices rather than relying on the input being random.
Random permutation: A permutation of a set chosen uniformly at random from all \(n!\) permutations; generated in place by the RANDOMLY-PERMUTE algorithm.
Binary search tree (BST): A binary tree satisfying the BST property: keys in the left subtree of any node are \(\leq\) that node’s key, and keys in the right subtree are \(\geq\) that node’s key.
Randomly built BST: A BST obtained by inserting \(n\) distinct keys one by one in a uniformly random order; has expected height \(\Theta(\log n)\).
Catalan number: The number of distinct binary tree shapes on \(n\) nodes, given by \(C_n = \frac{1}{n+1}\binom{2n}{n}\).
Pivot (quicksort): The element chosen to partition the array during quicksort; in randomized quicksort, chosen uniformly at random from the current subarray.
Bucket sort: A sorting algorithm that distributes elements into \(n\) equal-width buckets, sorts each bucket with insertion sort, and concatenates results; runs in \(\Theta(n)\) expected time when input is uniformly distributed over \([0,1)\).

3. Formulas

Probability of complementary event: \(\text{Pr}\{\overline{A}\} = 1 - \text{Pr}\{A\}\)
Inclusion-exclusion (two events): \(\text{Pr}\{A \cup B\} = \text{Pr}\{A\} + \text{Pr}\{B\} - \text{Pr}\{A \cap B\}\)
Expected value: \(E[X] = \displaystyle\sum_x x \cdot \text{Pr}\{X = x\}\)
Linearity of expectation: \(E[X + Y] = E[X] + E[Y]\) and \(E[\alpha X] = \alpha E[X]\)
Function of random variable: \(E[g(X)] = \displaystyle\sum_x g(x) \cdot \text{Pr}\{X = x\}\)
Independence (product rule): If \(X\) and \(Y\) are independent, then \(E[XY] = E[X] \cdot E[Y]\)
Indicator variable lemma: \(E[I\{A\}] = \text{Pr}\{A\}\)
Expected hires (hiring problem): \(E[\text{hires}] = \displaystyle\sum_{i=1}^n \frac{1}{i} = H_n = \Theta(\log n)\)
Probability of comparison (quicksort): \(\text{Pr}\{z_i \text{ compared with } z_j\} = \dfrac{2}{j - i + 1}\)
Expected comparisons (quicksort): \(E[X] = O(n \log n)\)
Expected bucket size squared: \(E[n_i^2] = 2 - \dfrac{1}{n}\)
Catalan number: \(C_n = \dfrac{1}{n+1}\dbinom{2n}{n}\)
Harmonic number: \(H_n = \displaystyle\sum_{i=1}^n \frac{1}{i} = \Theta(\log n)\)

4. Examples

4.1. Prove or Disprove Probability Identities (Problem Set 7, Task 1)

For each of the following equalities, either prove it using the axioms of probability and basic set identities, or provide a counterexample:

(a) \(\text{Pr}\{A \setminus B\} = \text{Pr}\{A\} - \text{Pr}\{B\}\)

(b) \(\text{Pr}\{A \cup (B \cap C)\} = \text{Pr}\{(A \cap B) \cup C\}\)

Click to see the solution

(a) \(\text{Pr}\{A \setminus B\} = \text{Pr}\{A\} - \text{Pr}\{B\}\):

This is false in general. We provide a counterexample.

Let \(S = \{1, 2, 3\}\) with uniform distribution (\(\text{Pr}\{s\} = \frac{1}{3}\) for each \(s\)). Let \(A = \{1, 2\}\) and \(B = \{2, 3\}\).

Then:

\(\text{Pr}\{A \setminus B\} = \text{Pr}\{\{1\}\} = \frac{1}{3}\)
\(\text{Pr}\{A\} - \text{Pr}\{B\} = \frac{2}{3} - \frac{2}{3} = 0\)

Since \(\frac{1}{3} \neq 0\), the equality fails.

The correct formula is: \(\text{Pr}\{A \setminus B\} = \text{Pr}\{A\} - \text{Pr}\{A \cap B\}\), which follows from \(A = (A \setminus B) \cup (A \cap B)\) with the two parts disjoint.

(b) \(\text{Pr}\{A \cup (B \cap C)\} = \text{Pr}\{(A \cap B) \cup C\}\):

This is false in general. We provide a counterexample.

Let \(S = \{1, 2, 3, 4\}\) with uniform distribution. Let \(A = \{1\}\), \(B = \{2\}\), \(C = \{3\}\).

\(B \cap C = \{2\} \cap \{3\} = \emptyset\), so \(A \cup (B \cap C) = \{1\}\), thus \(\text{Pr}\{A \cup (B \cap C)\} = \frac{1}{4}\).
\(A \cap B = \{1\} \cap \{2\} = \emptyset\), so \((A \cap B) \cup C = \{3\}\), thus \(\text{Pr}\{(A \cap B) \cup C\} = \frac{1}{4}\).

This happens to be equal here. Let’s try \(A = \{1, 2\}\), \(B = \{2, 3\}\), \(C = \{3, 4\}\):

\(B \cap C = \{3\}\), so \(A \cup (B \cap C) = \{1, 2, 3\}\): \(\text{Pr} = \frac{3}{4}\).
\(A \cap B = \{2\}\), so \((A \cap B) \cup C = \{2, 3, 4\}\): \(\text{Pr} = \frac{3}{4}\).

Equal again. Let’s try \(A = \{1\}\), \(B = \emptyset\), \(C = \{2, 3, 4\}\):

\(B \cap C = \emptyset\), so \(A \cup (B \cap C) = \{1\}\): \(\text{Pr} = \frac{1}{4}\).
\(A \cap B = \emptyset\), so \((A \cap B) \cup C = \{2, 3, 4\}\): \(\text{Pr} = \frac{3}{4}\).

Since \(\frac{1}{4} \neq \frac{3}{4}\), the equality fails.

Answer: (a) False — counterexample given above. (b) False — counterexample with \(A = \{1\}\), \(B = \emptyset\), \(C = \{2,3,4\}\) in a 4-element uniform space.

4.2. Prove or Disprove Expectation Identities (Problem Set 7, Task 2)

For each of the following, either prove it or provide a counterexample. Assume all random variables take real values and all expected values exist.

(a) \(E\!\left[\dfrac{X + Y + Z}{3}\right] = \dfrac{E[X] + E[Y] + E[Z]}{3}\)

(b) \(E[\sqrt{XY}] = \sqrt{E[X]\,E[Y]}\) when \(X\) and \(Y\) are independent.

(c) \(E[\max(X, Y)] = \max(E[X], E[Y])\)

(d) \(E[X^3] = (E[X])^3\)

Click to see the solution

(a) \(E\!\left[\dfrac{X + Y + Z}{3}\right] = \dfrac{E[X] + E[Y] + E[Z]}{3}\):

This is true. By linearity of expectation: \[E\!\left[\frac{X + Y + Z}{3}\right] = \frac{1}{3} E[X + Y + Z] = \frac{1}{3}(E[X] + E[Y] + E[Z]) = \frac{E[X] + E[Y] + E[Z]}{3}\]

(b) \(E[\sqrt{XY}] = \sqrt{E[X]\,E[Y]}\) when \(X\) and \(Y\) are independent:

This is false in general, even with independence.

Counterexample: Let \(X = Y\) be independent copies of a variable taking values \(1\) and \(4\) each with probability \(\frac{1}{2}\).

Then:

\(E[X] = E[Y] = \frac{1}{2}(1) + \frac{1}{2}(4) = \frac{5}{2}\)
\(\sqrt{E[X] \cdot E[Y]} = \sqrt{\frac{5}{2} \cdot \frac{5}{2}} = \frac{5}{2}\)
\(E[\sqrt{XY}] = E[\sqrt{X}]\cdot E[\sqrt{Y}]\) (by independence) \(= \left(\frac{1}{2}\cdot 1 + \frac{1}{2}\cdot 2\right)^2 = \left(\frac{3}{2}\right)^2 = \frac{9}{4}\)

Since \(\frac{9}{4} \neq \frac{5}{2}\), the equality fails.

The correct statement would use Jensen’s inequality: since \(\sqrt{\cdot}\) is concave, \(E[\sqrt{Z}] \leq \sqrt{E[Z]}\).

(c) \(E[\max(X, Y)] = \max(E[X], E[Y])\):

This is false in general.

Counterexample: Let \(X\) and \(Y\) be fair coin flips: \(X\) is 0 or 1 with equal probability, and independently \(Y\) is 0 or 1 with equal probability.

\(E[X] = E[Y] = \frac{1}{2}\), so \(\max(E[X], E[Y]) = \frac{1}{2}\).
\(\max(X, Y) = 0\) only when \(X = Y = 0\) (probability \(\frac{1}{4}\)); otherwise \(\max(X,Y) = 1\).
\(E[\max(X, Y)] = 0 \cdot \frac{1}{4} + 1 \cdot \frac{3}{4} = \frac{3}{4}\).

Since \(\frac{3}{4} \neq \frac{1}{2}\), the equality fails. In general, \(E[\max(X,Y)] \geq \max(E[X], E[Y])\).

(d) \(E[X^3] = (E[X])^3\):

This is false in general.

Counterexample: Let \(X\) take values \(-1\) and \(1\) with equal probability \(\frac{1}{2}\).

\(E[X] = \frac{1}{2}(-1) + \frac{1}{2}(1) = 0\), so \((E[X])^3 = 0\).
\(X^3 = -1\) when \(X = -1\) and \(X^3 = 1\) when \(X = 1\), so \(E[X^3] = \frac{1}{2}(-1) + \frac{1}{2}(1) = 0\).

This happens to be equal here, but the equality is coincidental. Let \(X\) take values 0 and 2 with equal probability:

\(E[X] = 1\), so \((E[X])^3 = 1\).
\(E[X^3] = \frac{1}{2}(0) + \frac{1}{2}(8) = 4 \neq 1\).

The equality \(E[X^3] = (E[X])^3\) fails. In general, \(E[g(X)] \neq g(E[X])\) for nonlinear \(g\) (Jensen’s inequality).

Answer: (a) True by linearity. (b) False — counterexample given. (c) False — counterexample given. (d) False — counterexample given.

4.3. Modular-Search Algorithm (Problem Set 7, Task 3)

Consider a Modular-Search algorithm that traverses an array using modular arithmetic: fix a constant step \(s\) (e.g., \(s = 17\)). Start at index 1, then visit indices \(1, 1+s, 1+2s, \ldots\) (reducing modulo \(n\) to stay in \([1..n]\)). When \(\gcd(n, s) = 1\), every index is visited exactly once over \(n\) steps. The algorithm stops as soon as it finds an occurrence of the target value \(x\).

The array \(A[1..n]\) contains exactly \(k\) copies of \(x\).

(a) Give a high-level description and pseudocode for Modular-Search.

(b) Analyze worst-case, best-case, and expected running time.

(c) Write pseudocode for Randomized-Search (using a random permutation instead of a fixed step).

(d) Derive the expected running time of Randomized-Search.

(e) Compare Modular-Search and Randomized-Search: when is the deterministic algorithm slow, and how does randomization help?

Click to see the solution

Key Concept: Modular-Search is a deterministic traversal; Randomized-Search uses a random permutation to avoid adversarial worst cases.

(a) Pseudocode for Modular-Search:

MODULAR-SEARCH(A, n, x, s)
1  idx = 1
2  for step = 1 to n
3      if A[idx] == x
4          return idx       // found x
5      idx = (idx - 1 + s) mod n + 1   // advance by s modulo n (1-indexed)
6  return NOT_FOUND

High-level description: We probe the array at positions \(1, 1+s, 1+2s, \ldots\) (all modulo \(n\), 1-indexed). When \(\gcd(n, s) = 1\), this visits all \(n\) positions before repeating. We stop at the first position where \(A[idx] = x\).

(b) Running time analysis:

Worst case: An adversary places all \(k\) copies of \(x\) at the last \(k\) positions that Modular-Search visits. Then we must examine \(n - k + 1\) positions before finding \(x\).

\[T_{\text{worst}} = \Theta(n - k + 1)\]

For \(k = 0\): always \(\Theta(n)\). For constant \(k\): still \(\Theta(n)\).

Best case: The very first position visited contains a copy of \(x\). We stop after 1 step:

\[T_{\text{best}} = \Theta(1)\]

Expected case (assuming the \(k\) positions of \(x\) are uniformly random):

Since all \(n\) positions are visited in a fixed but permuted order, and the \(k\) positions of \(x\) are uniformly random, by symmetry the expected position of the first occurrence of \(x\) in the modular traversal order is \(\frac{n+1}{k+1}\) (the expected minimum of a hypergeometric-like variable). More precisely, the expected number of steps until the first hit is:

\[E[T] = \frac{n+1}{k+1} = \Theta\!\left(\frac{n}{k}\right)\]

(c) Pseudocode for Randomized-Search:

RANDOMIZED-SEARCH(A, n, x)
1  pi = RANDOMLY-PERMUTE([1..n])   // generate a random permutation of indices
2  for i = 1 to n
3      if A[pi[i]] == x
4          return pi[i]            // found x
5  return NOT_FOUND

(d) Expected running time of Randomized-Search:

For any fixed array with exactly \(k\) copies of \(x\), the random permutation \(\pi\) places the \(k\) occurrences at uniformly random positions within the permutation order. Let \(T\) be the position of the first occurrence of \(x\) in the permuted order.

Using indicator variables: let \(X_i = I\{\text{no copy of } x \text{ appears in } \pi(1), \ldots, \pi(i)\}\). Then:

\[E[T] = \sum_{i=0}^{n-1} \text{Pr}\{\text{first } i \text{ positions contain no } x\} = \sum_{i=0}^{n-k} \frac{\binom{n-k}{i}}{\binom{n}{i}} \cdot \ldots\]

By a cleaner argument: the first occurrence of \(x\) in a random permutation of \(n\) elements, \(k\) of which are “hits”, has expected position \(\frac{n+1}{k+1}\).

\[E[T] = \frac{n+1}{k+1} = \Theta\!\left(\frac{n}{k}\right)\]

This holds for any fixed input, since randomness comes from the permutation, not the input.

(e) Comparison:

When is Modular-Search slow? An adversary who knows the step \(s\) can arrange the \(k\) copies of \(x\) at the last \(k\) positions in the modular order, forcing \(\Theta(n)\) comparisons regardless of \(k\).

How does randomization help? Randomized-Search generates a new random permutation for each run. Even if an adversary knows the algorithm, they cannot predict the permutation, so they cannot arrange the data to force the worst case. For any fixed input with \(k\) copies of \(x\), the expected time is \(\Theta(n/k)\) — proportionally faster for larger \(k\), and \(\Theta(n)\) only when \(k = O(1)\).

Answer: Modular-Search has worst-case \(\Theta(n)\) (controllable by an adversary); Randomized-Search has expected time \(\Theta(n/k)\) against any fixed input.

4.4. Determining Uniform Distributions (Lecture 7, Example 1)

For each of the following, determine whether the probability distribution is uniform:

(a) Rolling a fair six-sided die.

(b) Rolling two fair six-sided dice and looking at the sum of the faces.

(c) Choosing a random student ID from a class of \(n\) students.

Click to see the solution

(a) Rolling a fair six-sided die:

The sample space is \(S = \{1, 2, 3, 4, 5, 6\}\). Each face has probability \(\frac{1}{6}\), so each outcome is equally likely.

Uniform? Yes.

(b) Sum of two fair dice:

The sample space for individual rolls is \(\{1,\ldots,6\} \times \{1,\ldots,6\}\) (36 equally likely pairs). But if we look at the sum, the sample space is \(\{2, 3, \ldots, 12\}\), and these values are not equally likely. For instance:

Sum = 2 occurs only 1 way: \((1,1)\) — probability \(\frac{1}{36}\)
Sum = 7 occurs 6 ways: \((1,6),(2,5),(3,4),(4,3),(5,2),(6,1)\) — probability \(\frac{6}{36}\)

Uniform? No.

(c) Choosing a random student ID:

By default, “choosing at random” means uniformly at random. Each of the \(n\) student IDs has probability \(\frac{1}{n}\).

Uniform? Yes (by convention).

4.5. Dependent Random Variables (Lecture 7, Example 2)

Give an example of two random variables that are not independent.

Click to see the solution

Key Concept: Two variables are dependent if knowing the value of one changes the probability distribution of the other.

Let \(S = \{H, T\}\) be the sample space of a single fair coin toss. Define: \[X(s) = \begin{cases} 1 & s = H \\ 0 & s = T \end{cases}, \qquad Y(s) = \begin{cases} 1 & s = T \\ 0 & s = H \end{cases}\]

Then:

\(\text{Pr}\{X = 1\} = \frac{1}{2}\) and \(\text{Pr}\{Y = 1\} = \frac{1}{2}\)
\(\text{Pr}\{X = 1 \text{ and } Y = 1\} = \text{Pr}\{\text{both H and T}\} = 0\)

But \(\text{Pr}\{X = 1\} \cdot \text{Pr}\{Y = 1\} = \frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4} \neq 0\).

Since the joint probability does not equal the product of the individual probabilities, \(X\) and \(Y\) are dependent. Intuitively: if \(X = 1\) (heads), then \(Y\) must be 0 — knowing \(X\) tells you exactly what \(Y\) is.

Answer: \(X = I\{\text{heads}\}\) and \(Y = I\{\text{tails}\}\) on a single coin toss are not independent.

4.6. Expected Number of Heads (Lecture 7, Example 3)

When tossing a fair coin \(n\) times, what is the expected number of heads?

Click to see the solution

Key Concept: Decompose the count into indicator variables and use linearity of expectation.

Define indicator variables: For each toss \(i\), let: \[X_i = I\{\text{the } i\text{-th toss results in heads}\}\]
Express total heads: The total number of heads is \(X = \sum_{i=1}^n X_i\).
Apply linearity of expectation and the indicator lemma: \[E[X] = E\left[\sum_{i=1}^n X_i\right] = \sum_{i=1}^n E[X_i] = \sum_{i=1}^n \text{Pr}\{\text{the } i\text{-th toss is heads}\} = \sum_{i=1}^n \frac{1}{2} = \frac{n}{2}\]

Answer: The expected number of heads in \(n\) tosses is \(\frac{n}{2}\).

4.7. Probabilistic Analysis of the Hiring Problem (Lecture 7, Example 4)

When evaluating \(n\) candidates (each arriving in a uniformly random order), what is the expected number of times we hire (replace) an agent?

Click to see the solution

Key Concept: Use indicator variables. The \(i\)-th candidate is hired if and only if they are the best among the first \(i\) candidates.

Define indicator variables: For each \(i\), let: \[X_i = I\{\text{the } i\text{-th candidate is hired}\}\]
Compute probabilities: The \(i\)-th candidate is hired iff they are the best of \(\{1, \ldots, i\}\). Since the order is uniformly random, each of the first \(i\) candidates is equally likely to be the best: \[\text{Pr}\{X_i = 1\} = \frac{1}{i}\]
Expected total hires: \[E[X] = \sum_{i=1}^n E[X_i] = \sum_{i=1}^n \frac{1}{i} = H_n = \Theta(\log n)\]

Concretely: the 1st candidate is always hired (\(\text{Pr} = 1\)), the 2nd with probability \(\frac{1}{2}\), the 3rd with probability \(\frac{1}{3}\), and so on.

Answer: The expected number of hires is \(\Theta(\log n)\), much less than the worst-case \(n\).

4.8. Expected Score in a Coin Game (Lecture 7, Task 1)

Consider a game in which we toss a pair of coins. For each heads you gain 3 points, and for each tails you lose 2 points. What is the expected value of your score?

Click to see the solution

Key Concept: Define a random variable over the sample space, compute \(E[X] = \sum_x x \cdot \text{Pr}\{X = x\}\).

Define the sample space: \[S = \{HH, HT, TH, TT\}\] Assuming a fair coin, each outcome has probability \(\frac{1}{4}\).
Define the score random variable \(X\):
- \(X(HH) = 3 + 3 = +6\) (two heads)
- \(X(HT) = 3 - 2 = +1\) (one head, one tail)
- \(X(TH) = 3 - 2 = +1\) (one head, one tail)
- \(X(TT) = -2 - 2 = -4\) (two tails)
Compute the expected value: \[E[X] = 6 \cdot \frac{1}{4} + 1 \cdot \frac{1}{4} + 1 \cdot \frac{1}{4} + (-4) \cdot \frac{1}{4} = \frac{6 + 1 + 1 - 4}{4} = \frac{4}{4} = 1\]

Answer: The expected score is \(1\) point.

4.9. Expected Height of a Randomly Chosen BST (Lecture 7, Task 2)

What is the expected height of a randomly chosen binary search tree?

(a) With 5 nodes.

(b) With \(n\) nodes (general formula).

Click to see the solution

Key Concept: A randomly chosen BST means we pick uniformly at random among all valid BST shapes (not among all insertion orders). The number of distinct BST shapes on \(n\) nodes is the \(n\)-th Catalan number \(C_n\).

(a) For \(n = 5\):

Step 1 — Count the shapes. The number of BST shapes on 5 nodes is: \[C_5 = \frac{1}{6}\binom{10}{5} = \frac{252}{6} = 42\]

The lecture scratchpad verifies this by grouping shapes by root value. For a BST on keys \(\{1,2,3,4,5\}\), if key \(k\) is the root, then the left subtree has \(k-1\) nodes (\(C_{k-1}\) shapes) and the right subtree has \(5-k\) nodes (\(C_{5-k}\) shapes):

Root \(k\)	Left shapes	Right shapes	Subtotal
1	\(C_0 = 1\)	\(C_4 = 14\)	14
2	\(C_1 = 1\)	\(C_3 = 5\)	5
3	\(C_2 = 2\)	\(C_2 = 2\)	4
4	\(C_3 = 5\)	\(C_1 = 1\)	5
5	\(C_4 = 14\)	\(C_0 = 1\)	14
Total			42

Step 2 — Heights. For \(n = 5\) nodes, heights range from 2 (minimum: a nearly complete tree) to 4 (maximum: a linear chain). Computing the exact expected height requires enumerating all 42 shapes, which is done computationally. The expected height is approximately 2.7 for \(n = 5\).

(b) For general \(n\):

A uniformly random BST shape (one of \(C_n\) shapes chosen uniformly) corresponds to a random binary plane tree. It is known from combinatorics that the expected height of such a tree grows as \(\Theta(\sqrt{n})\).

This is significantly worse than the \(\Theta(\log n)\) expected height of a randomly built BST (from a random insertion order). The key insight: the uniform distribution over shapes is not the same as the uniform distribution over insertion orders, and the two give very different height behavior.

Answer: For \(n = 5\): \(C_5 = 42\) shapes, heights 2–4, expected height \(\approx 2.7\). For general \(n\): expected height \(= \Theta(\sqrt{n})\).

4.10. Expected Height of a Randomly Built BST (Lecture 7, Task 3)

What is the expected height of a randomly built binary search tree obtained by inserting keys one by one into an initially empty tree?

(a) 5 keys.

(b) \(n\) keys (general formula).

Click to see the solution

Key Concept: A randomly built BST inserts keys in a uniformly random permutation order — each of the \(n!\) orderings is equally likely. The resulting height depends heavily on which key becomes the root (and then the sub-roots recursively).

Lecture worked example (\(n = 4\)): The professor enumerated all \(4! = 24\) permutations of \(\{1, 2, 3, 4\}\) to illustrate the method. The result is:

\[E[h] = 2 \cdot \frac{16}{24} + 3 \cdot \frac{8}{24} = \frac{32}{24} + \frac{24}{24} = \frac{56}{24} = 2\tfrac{1}{3}\]

because 16 permutations yield trees of height 2 and 8 permutations yield trees of height 3.

(a) For \(n = 5\):

There are \(5! = 120\) permutations of \(\{1, 2, 3, 4, 5\}\). The same enumeration approach applies (greatly aided by a computer). The key groups are:

The first key inserted becomes the root. If it is 1 or 5, the tree immediately degenerates to a chain on one side.
Middle root values (2, 3, or 4) create a more balanced split.

The expected height for \(n = 5\) is approximately \(2.8\) (between 2 and 4), computed by enumerating and averaging over all 120 permutations.

(b) For general \(n\):

It is proven (Cormen et al. 2022, §12.4) that the expected height of a randomly built BST on \(n\) nodes is \(\Theta(\log n)\).

The intuition: in a random permutation, the root is equally likely to be any of the \(n\) keys. Once the root is chosen, the left and right subtrees are themselves randomly built BSTs of expected sizes close to \(\frac{n}{2}\) each (on average), leading to a balanced recursive structure and thus \(O(\log n)\) height.

Answer: For \(n = 5\): expected height \(\approx 2.8\) (computed by enumeration). For general \(n\): expected height \(= \Theta(\log n)\).